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Learning Theory Approach to Minimum Error Entropy 

Criterion''" 

Ting Hu, Jim Fan, Qiang Wu, and Ding-Xuan Zhou 

Abstract 

We consider the minimum error entropy (MEE) criterion and an empirical risk 
minimization learning algorithm in a regression setting. A learning theory approach is 
presented for this MEE algorithm and explicit error bounds are provided in terms of 
the approximation ability and capacity of the involved hypothesis space when the MEE 
scaling parameter is large. Novel asymptotic analysis is conducted for the generaliza- 
tion error associated with Renyi's entropy and a Parzen window function, to overcome 
technical difficulties arisen from the essential differences between the classical least 
squares problems and the MEE setting. A semi-norm and the involved symmetrized 
least squares error are introduced, which is related to some ranking algorithms. 

Keywords: minimum error entropy, learning theory, Renyi's entropy, empirical risk 
minimization, approximation error 



1 Introduction 

Least squares method is a fundamental computational tool in various fields of science and 
engineering. It has been well understood mathematically due to the quadratic form of its 
related least squares loss function which is perfect to deal with problems involving Gaussian 
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noise (such as some from linear signal processing). The least squares method has many 
extensions or alternatives for different purposes. An information theoretic alternative, min- 
imum error entropy (MEE) criterion [5], is based on entropy, a measurement for average 
information, defined in various forms such as Shannon's entropy and Renyi's entropy. 

Renyi 's entropy (of order 2) for a random variable e is defined in terms of its probability 
density function (pdf) f e as H(e) = — logE[/ e (e)] = — log J(f e (e)) 2 de. The pdf is often 
unknown. Instead, to estimate the entropy, one needs to learn the density from a sample 
{ej}™^. A practical way for approximating f e is Parzen [8] windowing ^Et=i^(^i)^") 
by means of a window function G : R+ — > K with G(t) = exp{— t} to be a typical example 
corresponding to Gaussian windowing. Then Renyi's entropy can be estimated through its 
discretized version called empirical Renyi's entropy defined by 



1 m m sn\ 

7 — 1 1 — 1 X ' 



= 1 j = l 

Minimizing this computable quantity with e being an error random variable in various ways 
leads to different MEE algorithms. In this paper, we study an MEE learning algorithm for 
regression in an empirical risk minimization (ERM) setting. 

The regression problem aims at learning a regression function defined on a separable 
metric space X (input space for learning) with values in Y = R (output space). To model 
the learning problem, we assume that p is a Borel probability measure on Z := X x Y and 
z = {(%i,yi)}%Li is a sample independently drawn according to p. With a test function / 
on X, the error random variable e on Z for Renyi's entropy takes the form e = y — f{x). 
Putting this into the empirical Renyi's entropy H leads to our MEE learning algorithm in 
an ERM setting. 

Definition 1. Let G be a continuous function defined on [0, oo) and h > 0. Let % be a 
compact subset of C(X). Then MEE learning algorithm associated with H, is defined by 

^"«^{-^^gg c ( ' (W " /( '' )) ^" /(J,))1 ' )}- (L1) 

The set Ti is called the hypothesis space for learning. Its compactness ensures the exis- 
tence of a minimizer f z . Computational methods for solving optimization problem (11. ip and 
its applications in signal processing have been described in a vast MEE literature [91 151 161 ITU] . 
Asymptotic behaviors of / z for small or large MEE Scaling parameter h have also been dis- 
cussed for different purposes. It has been observed that the MEE criterion has nice conver- 
gence properties when the MEE Scaling parameter h becomes large. The first purpose of 
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this paper is to verify this observation in the ERM setting and show that f z approximates 
the regression function well with confidence. Here the regression function f p is defined by 

f P ( x ) = J ydp{y\ x ), x e x, 

where p{-\x) is the conditional distribution of p at x G X. 

Our mathematical analysis for the convergence of f z to f p is stated in terms of the 
approximation ability of the hypothesis space H and its capacity. The approximation ability 
is measured by the approximation error. We assume f p G L 2 



px m 



Definition 2. Define a semi-norm ||| • \ on the space L px as 

|||/|||^=™ 11/ -cll^, feL 2 px . (1.2) 
The approximation error of the pair (H, f p ) is defined by 

V n (f p ) = inf HI/ - f M l lx = W ™ 11/ - cllV (1.3) 

The minimizer in (jl.2p is achieved by the constant c* = J x f{x)dpx and in (11.31) by the 
constant J x /(x) — f p {x)dpx- The approximation error for the least squares ERM regression 
was studied in [TTj. An essential difference between that and the approximation error here 
is an additional constant function, which is similar to an offset in support vector machines 



The capacity of the hypothesis space H is measured by covering numbers in this paper. 

Definition 3. Fore > 0, the covering number Af (H, e) is defined to be the smallest integer 
I G N such that there exist I disks with radius e in C(X) covering the set %. We shall assume 
that for some constants p > and A p > 0, there holds 

log A/" (H, e) < A p e- p , Ve > 0. (1.4) 

The asymptotic behavior (jl.4p of the covering numbers is typical in learning theory. It 
is satisfied by balls of Sobolev spaces on X C M n and reproducing kernel Hilbert spaces 
associated with Sobolev smooth kernels. See (21 [161 HTl IT4"] . 

Throughout the paper we assume that 
G G C 2 [0,oo), G' + (0) = -1, ||G"||oc < oo, \\tG' (i 2 /2)|| 0O := sup \tG'(t 2 /2)\ < oo. 
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Note that the special example G(t) = exp{— t} for the Gaussian windowing satisfies the 
above assumption with ||G"||oo — 1 an d ||tG ?/ (t 2 /2) || oo = e -1 / 2 < oo. 

We also assume J z y 4 dp < oo and f p G . The following error bound for (11. ip with 
large h will be proved in Section |5j 

Theorem 1. Assume covering number condition (fj.^[ ) with some p > 0. Define f x by U.l\) 
with h > 1 and m > 1. Then for any < rj < 1 and < S < 1, with confidence 1 — 5 we 
have 

- s I (f 4 + S) ^ + (i + (ls) 

where C-u is a constant independent ofm,S or h (depending on H). In particular, if h — 
i 

m i+3 p , we have 

~ 2 

< — (-j^iog^+a+^c/p). (i.6) 

« n \m Jo 

Remark 1. In TheoremUi we use a parameter rj > in the error bounds ( ti.51) and ( ti.gj) 
to show that the bounds consist of two terms, one of which is essentially the approximation 
error V^(f p ) since rj can be arbitrarily small. The reader can simply set rj = 1 to get the 
main ideas of our analysis. 

When fp + CpiEU for some constant c p G R, we know that V n (f p ) = 0. In this case, the 
choice rj — 1 in Theorem [T] yields the following learning rate. 

Corollary 1. Assume with some p > and f p + c p El-L for some constant c p G M. 

Define f z by ( GUP with h = m 4 + 3 p and m > 1. Tnen with confidence 1 — 5 we have 

III/.-/pIII!* =\\h-f P - Ux)-f p {x)d Px \\\ 2 <3cw(- log-. 

A special example of the hypothesis space is a ball of a Sobolev space H S (X) with index 
s > | on a domain X C M n which satisfies fll.4p with p = -. When s is large enough, the 
positive index - can be arbitrarily small. Then the power exponent of the following learning 
rate can be arbitrarily close to |. 

Example 1. Let X be a bounded domain ofM. n with Lipschitz boundary. If f p G H S (X) for 
some s > | and H = {/ G H S (X) : ||/||fp(x) < R} with R > \\f p \\H s (x), then for f z defined 
by ( tZHP wi/i /i = m 4+3n / s and m > 1, win confidence 1 — 5 we have 

2 

~ / 1 \ 4 + 3 ™/ s 9 

|||/z-/p|||L <3C n - log-. 
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Our error analysis for algorithm ( II. ip is based on asymptotic behaviors of the involved 
generalization error associated with the window function G. The Taylor expansion G(t) ~ 
G(0) + G' + (0)t leads us to consider the following symmetrized least squares error which has 
appeared in the literature of ranking algorithms |3] [T] . 

Definition 4. The symmetrized least squares error is defined on the space L 2 x by 

£ sls U)= [ [ [(y-f(x))-(v-f(u))] 2 dp(x,y)dp(u,v), feL 2 px . (1.7) 
Jz Jz 

The second purpose of this paper is to reveal the following relation between the sym- 
metrized least squares error and the square of the semi-norm ||| • |||^2 , to be proved in the 
next section. We expect that this result can be applied to error analysis of some ranking 
algorithms. 

Theorem 2. If J z y 2 dp < oo, then 

£ sls (f) = n\f-f P \\\li +2 [ [y - f p (x)f dp, V/eL^. (1.8) 

Px Jz 



2 Information Error and Its Asymptotic Analysis 

In this section we study a functional called information error or generalization error asso- 
ciated with the window function G defined over the space of measurable functions on X 

as 

£ "' ,{f) = LL- h2G ( fy - f(x,) ^- mf \ ip(*,y)ip(u,v) 

and investigate its asymptotic behavior as h tends to infinity. 
Denote a constant C p associated with the measure p as 



C P = / [y - fp(x)] 2 dp. 
Jz 



Theorem 3. If G E C 2 [0,oo) with ||Cr"||oo = su Po<t<oo < 00 an ^ fzy 4( ^P < °°; then 

for any essentially bounded measurable function f on X, we have 



£(*)(/) + h 2 G(0) + G' + (0)C p + C + (0)|| |/ - /, 



2 



64IIG" 



h 2 



oo 
Z 



(2.1) 
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In particular, 



£<*>(/) + h 2 G(0) + G' + (0)C p + G' + (0)|||/ - /, 



P L 2 

PX 



a 



where C' u is the constant depending on p, h and f given by 



C' H = 64||G"| 



y dp + ( sup 

z \feH 



Proof. By the Taylor expansion \G(t) - G(0) - G' + (0)t\ < for t > 0, we know that 

- /(*)) -(v- f(u))} 2 



£W(f) + h 2 G(Q)+ / / G' + {oy 

Jz Jz 



-dp(x, y)dp(u, v) 



< 



\G"\ 




z J z 



< 



< 



8h 2 
M\G"\ 

~W -rz 
64||G"IL 



[{y ~ f{ x )) ~{v- f(u))} 4 dp(x,y)dp(u,v) 



4 4||G"|| 
[y - f{x)f dp < — jp-H I [y - f{x)f dp 



y 4 dp + 



h 2 vjz 

This together with Theorem [2] proves bound (12.11) and hence our conclusion. 



□ 



Applying Theorem [3] to a function / G H and f p G U^ x yields the following relation on 
the excess generalization error £^ h \f) — £ ( - h \f p ). 



Corollary 2. Under the condition of Theorem^ if f p G L^, we have 

V/GH, 



£ ih) (f)-£ {h \f P ) + G' + (o)\\\f-f p \\\% 



P ■ Px 

CH 



PX 



< 



H 



where 



C'J i = C' H + M\\G" 



We end this section by proving Theorem [2] stated in the introduction. 
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Proof of Theorem By the definition of the regression function, we have 

£ sls U) = [ I [(y-f(x))-(v-f(u))} 2 dp(x,y)dp(u,v) 
Jz Jz 

z [Ix ~ ^ ~ <yV ~ f( u ^ 2 d Px( x ) + j z \y - fp( x )f d p( x > y) } d p( u ' v ) 

= 11 lfp( x ) - f( x ) - (fp( u ) - f( u ))f dp x (x)dpx(u) 

J X J X 

+ / / [v - f P (u)] 2 dp(u,v)dp x (x) + / / [y - f p (x)] 2 dp(x,y)dp(u,v) 
J X J z J z J z 

= 11 [f(x)-f p (x)-(f(u)-f p (u))] 2 dp x (x)dp x (u) + 2f [y - f p (x)} 2 dp. 

J X J X J z 

Let c* be the best approximation in L 2 px of / — f p from the subspace of the constant 
functions. That is, 

c* = argmin \\f - f p - c\\ L 2 



PX 

Then the function / — f p — c* is orthogonal to any constant function in the Hilbert space 
L l x , and 11/ - f P ~ c *h 2 Px = III/ - fp\\\Lj x - It follows that 



x 



(/Or) - f p (x) - c) 2 d Px (x) = \c - c*\ 2 + || |/ - f p \ f, Vc G R. 
With c = f{u) — fp{u) G M, this implies that for any fixed u G X , there holds 



[/(:r) - f p (x) - (f(u) - f p (u))} 2 dp x {x) = \f(u) - f p (u) -c*\ 2 + HI/ - / 1112 



X 



P\\\LI X 



Hence 




[f( x ) ~ fp( x ) ~ (f( u ) ~ f P (u))} 2 dp x (x)dp x (u) 



X Jx 



|2 

ip\\\iA 



= / \f(u)-f p (u)-c*\ 2 dp x (u) + \\\f-f p \\\ 2 L2 =2\\\f-f p 
Jx x 

Then the desired equality follows. This proves Theorem [2j □ 

3 Error Decomposition for the ERM Algorithm 

Error decomposition has been a standard technique to analyze least squares ERM regression 
algorithms j2| [T2| [T5] . It decomposes the error ||/ — f p \\ 2 L2 into the sum of £ ls (f) — £ ls {fy) 
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(sample error) and £ ls {fy) — £ ls (f P ) = — fp\\% (approximation error) where £ ls (f) = 
f z (f(x) — y) 2 dp and f^ is a minimizer (called target function) of £ ls (f) in "H. A technical 
difficulty arises for the error decomposition of ERM algorithm (11.11) since there might be two 
ways to define a target function in "H, one to minimize the information error and the other 
to minimize the distance to f p under the semi-norm ||| • |||l2 . These possible candidates for 
the target function are defined as 

/ w :=argnrfn£< fc >(/), (3.1) 

/6rt 

fapprox ■= argmin |||/ - f p \\\v> x = argmmrmn ||/ -f p - c^. (3.2) 

The first technical novelty of this paper is to show that then the MEE scaling parameter h 
is large, these two functions are actually very close. 

Theorem 4. Under the condition of Corollary^ if G' + (0) < 0, then 

£ {h) (fa PP r OX )<£ (h \fn) + 2 -§ L 

and 

on" 

III f _ f III 2 < hi f _ f in 2 i n 

\\\J'H J p\\\L 2 — II I J approx Jf 



ex - pU]L2 Px ^ -G' + (0)h 2 ' 

Proof. By Corollary [2] and the definitions of fu and f apP rox, we have 

£ {k) (fn) ~ £ (h \fp) < ^(f approx) ~ £ {h \f P ) < -G' + (0)\\\f approx - f p \\\ 2 L2px + 

< -G' + (0)\\\f n - f p \\\l $x + S < eW{f H )-£W(J p ) + ^ 

< —G + (0)\\\f approx - fp\\\L2 x + ■ 

Then the desired inequalities follow. □ 

Corollary [2] actually yields the following error decomposition for our algorithm. 
Lemma 1. Under the condition of Corollary^ if G' + (0) < 0, then 

III/. - f P \\\k x < z^y { £ih) W - 8 ^ + M**— " f ^k x + -gv(o)^ ' (3 - 3) 

Proof. By Corollary [21 

-G' + (0)|||/. - / p ||| 2 < - £(*>(/„) + ^ 

< {£ {h \f*) ~ £ {h) (fn)} + £ W (/«) " £ {h) (f P ) + 



Since f apP rox £ the definition of tells us that 

£ (h Xfu) - < £ {h \fa Pprox ) - £ {h Xf P ). 

Applying Corollary [2] to the above bound implies 

2(7" 

-GV(0)|||/ z - / p |f < {£<*>(/„) - £<*>(/*)} " GV(0)|||/„ - / p ||| 2 + 
Then desired error decomposition (13. 3p follows. □ 

4 Sample Error Estimates 

In this section, we estimate the sample error S^ h \f z ) — £^ h \fu)- In this step, we demonstrate 
our second technical novelty. Define the empirical information error for measurable functions 
/ on X as 

y ! i=i j^i \ 
Then 

£ {h \h) - £ {h \fn) = £ {h \h) - + £i h) (f z ) - £i h \fu) + £i h \fn) - £ {h Xfn). 

By the definition of f z , we have Si (/*) — S% (fn) — 0- Hence 

£ {h \f z ) - S (h \fn) < S (h \f z ) - + Si h \f n ) - S (h \fn) = Si + S 2 , (4.1) 

where 

Si: = [£ {h XQ-£ {h \f P )]-[£i h XU)-£i h Xf P )], 

S2: = [£i h Xfn)-£i h \f P )]-[£ [h Xfn)-£ {h Xf P )\ 

We use Hoeffding's probability inequality for U-statistics [7] to bound Si and S 2 . 

Lemma 2. If U is a symmetric real-valued function on Z x Z satisfying a < U(z, z') < b 
almost surely and var(?7) = a 2 , then for any e > 0, 



Prob 



4<x 2 + (4/3) (6 - a)e j ' 



9 



Lemma 3. For any < S < 1, with confidence of 1 — |, there holds 

S2 <|| fGW 2)|Ulo g ?(^- + ^l*k^. 

I m — 1 Vm — 1 

Proof. Let {/ be the symmetric real- valued function on Z x Z defined in terms of two variables 

z = (x, y),z' = (u, v ) G Z as 

U(z, z>) = -h 2 G ( ' [(y-fn(x))-(v-Mu))] 2 \ + h2Q ( [{y - f p {x)) - (v - f p (u))f 



2h 2 j \ 2h 2 

Define a function g on E by 

g{t) = G(t 2 /2), t e R. (4.2) 
We see that g G C 2 (M), p(0) = G(0), = tG'(t 2 /2) with #'(()) = 0. Moreover, 

U(z,z>) = -h 2 g ( {V ~ MX)) - {V ~ MU)) ) + h*g ( {V ~ fp{x)) - {V ~ fp{u)) 



h \ h 



/\ i / ^ii^n \[fn(u) - f P {u)\ - [f n (x) - f p (x) 



It follows that 

\U(z,z')\<h „., h 
This tells us that 

|?7(z,0|<2||*G'(* 2 /2)|| 0O ||/ w -/ p || oo / l 
almost surely. Moreover, putting the constant 

c* = aigmm\\f H - f p - c\\ L 2 x 

into 

[/«(«) - f p (u)\ - [f n (x) - f p (x)] = [f n (u) - f p (u) - c*] - [f n (x) - f p (x) - c*] 
we find that 

var(C/) < E(U 2 ) < 4||tG"(t 2 /2)|| 2 JI/tt ~ f P ~ c*||£^ 2 = A\\tG' [t 2 /2)\\ 2 J\\f H - f P \\\l lx h 2 - 

Note that EU = - £<*>(/„) and j^E^E^tf = ^\fn) ~ Si h \f p ). 

Then by Lemma [2] we have 



Prob {^2 > e} < exp 



[m — \)s' 



16||tG'(tV2)||?olll/« - fp\\\h h 2 + 6||tC(tV2)||oo||/« - fpWoohe 

PX 
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By setting the above probability bound to be 8/2 and solving the corresponding quadratic 
equality, we know that with confidence at least 1 — |, we have 

c ^ 6||tG"(t72)||oo||jW-/p||oo/s 2 ±\\tG>(f/2)\U\\\f H -f p \\\ Llx h 2 

o 2 S : log - H log -. 

m — 1 8 i/m — 1 8 

Then our desired estimate follows. □ 

Estimating S\ is more involved. One possible way to get tight bounds is by Hoeffding's 
decomposition, as done for ranking algorithms in [3]. We take another way of applying a 
ratio probability inequality. Since f z depends on the sample z, we use covering numbers of 
the set H to give our bounds. 

A key difficulty in estimating S\ is caused by an essential difference between the informa- 
tion error S^ h \f) associated with the entropy and the generalization error associated with 
the least squares loss: E^ h \fj) — S i - h \f p ) is not equal to \\fj — f p \\ 2 L 2 ■ The second technical 
novelty of this paper is to apply Corollary |2] to bound a variance term (a 2 in Lemma [2]) by 
the following inequality 

if e > 9k,thenEW(f)-£W(f p )+e > £W(/)-£<*>(/ p ) + 3i > _ G ' + (0)|||/-/„|||^. (4.3) 

While this inequality is a easy consequence of Corollary |5J its special role in our estimation 
of S\ needs to be demonstrated with details in the following proof. 

Lemma 4. Under conditions of Corollary® if G' + (0) < 0, \\tG'(t 2 /2) < oo and 

s > S (4.4) 

then we have 

A h )ff\ c{h), 



Prob < sup 



[£ {h) u)-£ {h) u P )]- sr(f)-sr(f P : 



fen y/SW(f)-# h) (fp)+e 

(m — l)e 



- > 4-y/e \ <M [% 



( W ' 211^/2)1100/1) 



exp 



16||tG'(tV2)||^V(-GV(0)) + 6||tG'(t 2 /2)|| oo su P/eW ||/-/ p || oo / i . 
Proof. From the argument in the proof of Lemma El we see that for any f,g£H, we have 

\£ {h) (f) - £ {h \g)\ < 2\\tG'(t 2 /2)\U\f - gW^h 

and almost surely 

WXf) - £i h \g)\ < 2||tG'(t 2 /2)||oo||/ - gi^h 
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Thus, if ||/- /Joe < 



£ thpn 

2!|tG'(t 2 /2)j|oc/»' lllen 

[£(*)(/) _ £W(f p )] - \£i h \f) - £^ h \f p ) 



implies 



y/ew>U)-ewu P ) + e 
[ £ w( fj) -gw( fp )] - \ei h \ fj ) -Ei h \f p ) 



> 4Ve 



> Ve. 



Thus by taking {fj}f = i to be an 2 \\tG'(£/2)\\ — K ne ^ °^ the se ^ ^ with A" being the covering 
number A/" (ft, ^^f^^) , we find 



Prob { sup 

/eft 



< Prob \ sup 

j=i,...,iv 



< Prob 

j=l,...,iV 



[£(*>(/)-£(*>(/„)] - £i h \f)-£i h \f P ) 



Ah). 



y/sw>U)-ewu P ) + e 

p fc) (/i)-^(/p)] - te fc) (/i) -^U) 



v/^)(/,)-^)(/ P ) + e 



y/e«"Kfi)-* h KU) + e 



- > \fe 



Fix j G {1, . . . , AT}. Consider the function 



U(z,z') = h 2 G 



It satisfies 



almost surely and 



[( y - f 3 (x)) - (v - f 3 (u))f 
2h 2 



- h 2 G 



[(y - f p (x)) -(v- f p {u))\ 



2h 2 



\U(z,z')\<2\\tG'(t 2 /2)\\ 0O \\f j -f p \\ 0O h 



var(f/)<4||tG'(t 2 /2)||Llll/,-/pllli ? ^ 2 - 



Also, -EU = £W(f H ) - £<*>(/,) and £™i 
Set 

e = ^(/ J -)-^(/ p ) + e 1 



{^ fc) (/i)-€i fc) (/p)}. 



(4.5) 
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by Lemma [2] we find 



Prob 



< 



y/ew(f s )-ew(f p ) + e 

Prob{ [S^{Ji) -£ (h \f P )] - W\fi)-£i h) U P )\ > ^Ve 

(m — l)ee 



cxp 



iqtG'^WWUU - f P \\\h h2 + 611^(^/2)11^11/, - /.lUViv? 



Now we apply the important relation (14.31) to the function / = fj and find by noting the 
definition (14.511 for e that 



U-f P \\\l> < 



^ - -G' + (0) -G' + (0) 
This together with the inequalities < 1 and \\fj — f p \\oo < supf en \\f — f p \\oo gives 
[ £ M( fj ) _£(*)(/,)] - l&M) -£W(f p ) 



Prob 



< 



Therefore, 



exp 



(m — l)e 



> V£ 



16||iG'(t 2 /2)||^7(-G' + (0)) + 6||tC(tV2)||oo su P/eW ||/ - f^h 



Prob \ sup 

fen 



< N exp 



[£W(/) -£(*)(/,)] - £i h) (/)-£i AJ (/p) 



v/^)(/)-^)(/ p )+e 



> 4^ 



(m — l)e 



16||tG'(t 2 /2)||^V(-GV(0)) + 6||tG'(tV2)|| oo sup /eW ||/ - f^h. 
This proves the required inequality. □ 

We are in a position to bound Si and hence the sample error. 
Proposition 1. Under conditions of Corollary^ if G' + (0) < 0, ||tG'(t 2 /2)|| 00 < oo and 



covering number condition ftl-4\ ) is satisfied, then with confidence of 1 — S, we have 

£ w {f z ) -£ {h) {fn) <C n ,G, P ^<h 2 , flogT, , T , 1 

[ m-1 b Vm-1 ( m _i)T+iJ 

where Cy^^cp is a constant independent ofm,5 or h. 
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Proof. By condition (II. 4p . the probability bound in Lemma H] is at most 

2\\tG'(f/2)\\ O0 h\ p (m-l)e' 



exp <j v4 p 

where 



C h = 16||tG"(t 2 /2)||L^/(-G" + (0)) + 6||tG"(t 2 /2)||oo sup ||/ - f^h. 

fen 

Requiring this bound to be at most 5/2 is equivalent to the inequality 

m — 1 o m — 1 

By Lemma 7.2 in [1], we know that the above inequality is satisfied as long as 

2C h , 2 



e > max 



m 



Mog^, (4, (2||tG'(t 2 /2)||oo^) P C h ) 1/{1+p) (m - 



Now let us take a constant 

C£ = Cn + 32||tG'(t 2 /2)|| 2 J(-G' + (0)) + 12||tG'(t 2 /2)|U sup ||/ - /J*, + WA 1 ^ 

fen 

( ||tG'(t 2 /2)||^)/( 1+ rt(-G' + (0))- 1 /( 1+ ^ + ||tG'(t 2 /2)|| 00 sup ||/ - /J^A 
V fen J 

and take e to be given by 

e* =C^max(/i- 2 , ^-±^lo g H (ft + /^/^(m - 1)^1 . 

{ m — 1 o J 

With this choice, by Lemma HI we know that with confidence at least 1 — 5/2, we have 

[£W(f)-£M(f p j\ - \£i h \f)-£i 



sup 



fen ^£W U) _ £ W {fp) + £ * 

which implies in particular 



[£ {h \h) - £ {h \f P )] - [£^{h) - £^ h) (f P )} < 4v / ? v /^W(/ z ) - £W(/ P ) + e*. 
This together with the elementary inequality A-sfa\fb < | + 86 that 

5 1 <i(^(/z)-^ ) (/ P )) + 12e*. 
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We combine this estimate with Lemma [3] and (14. ip . and see that with confidence of 1 — 5, 

2 , f 6 . 4111/^-^111^ 



+||tG / (t72)|| 00 log-A 



5 I m-1 vm—T 

and hence 



£<*>(/.)-£<«(/„) < 24e* + 11^(^/2)^ log -/i 



2 f 12 8|||/h - /p||| L 2 



5 I m — 1 — 1 

By setting the constant CV^g,p as 

C n , GtP = 24C£ + 20\\tG'(t 2 /2)\\ oo , 
we verify the conclusion of Proposition [TJ □ 

5 Proof of the Error Bounds 

We are now in a position to prove our error bounds for algorithm ( 11.1 p stated in the intro- 
duction. 

Proof of TheoremUl Since G" + (0) = -1 and \\\f approx - /pHU^ = D H (f p ), by LemmafU we 
have 

III/- - /„lllig x < {e w (f.) -# h XM} + vn(f p ) + ^. 

Combining this with the bound in Proposition [1] for the sample error and the restriction 
h > 1 tells us that with confidence of 1 — 8, 

„, , lll2 ^_ f._ 2 4A 2 , 2 2|||/„-/ p ||Uj h _ ,., 2CJ 

IIIA-ZpIII^ <CW« ft », -log-, l +DK (/,) + 1 Jt. 



But 

2|||/«-/,IIU; x ft cfc^ 

^h,g,p >= S VWIjh - JpWIli + 



and according to TheoremHJ || \f H — f p \ \\ L < 2 < V H (f p ) + It follows that with confidence 
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of 1-5 



Mr mm- I i —2 ^Vo! 2 , C 'w,G,p\ 4/114* 

|/z-/p||| L 2 < C WiGiP max < h , — 21og- 1 



p x m \ o 7] 

(2 + 2rj)C? l 



) i 
m i+ P 



+(i + v )v n (f p ) + 



h 2 



C n ( 1 /i 2 fc^ \ , 2 



where 



C w = max {CV, P + 4C£, C n , G , p (2 + CW jGlP ) , 4C W , G , P } . 
This proves (jl.5p and thereby Theorem HJ □ 
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